


report resumes 

:D 017 810 

SAMPLE-FREE TEST CALIBRATION AND PERSON MEASUREMENT. PAPER 

presented at the national seminar on adult education research 

(CHICAGO, FEBRUARY 11-13, 1968). 



«T£ST reliability, models, item ANALYSIS, TEST RESULTS, GOERG 
RASCH MEASUREMENT MODEL, 

OBJECTIVITY IN MENTAL TESTING REQUIRES THAT TEST 
CALIBRATION BE INDEPENDENT OF WHICH PERSONS ARE USED FOR THE 
CALIBRATION AND THAT PERSON MEASUREMENT BE INDEPENDENT Of 
WHICH ITEMS ARE USED FOR THE MEASUREMENT. PRESENT PRACTICE IS 
NOT OBJECTIVE, BUT COULD BE SO, AS SHOWN BY THE EXAMPLE HERE 
presented. DATA COME FROM THE RESPONSES OF 976 LAW STUDENTS 
TO 48 READING COMPREHENSION ITEMS ON THE LAW SCHOOL 
ADMISSIONS TEST. THE POSSIBILITY OF PERSON FREE TEST 
CALIBRATION IS DEMONSTRATED BY SHOWING THAT A CALIBRATION 
BASED ON THE RESPONSES OF A DUMB GROUP OF STUDENTS CAN BE 
NEARLY IDENTICAL WITH ONE BASED ON A SMART GROUP. THE 
POSSIBILITY OF ITEM FREE PERSON MEASUREMENT IS DEMONSTRATED 
BY SHOWING THAT ABILITY ESTIMATES MADE FROM SCORES ON AN EASY 
TEST CAN BE STATISTICALLY EQUIVALENT TO THOSE MADE FROM A 
hard test. THE MEASUREMENT MODEL V'/HICH MAKES THIS OBJECTIVITY 
POSSIBLE WAS DEVELOPED BY GEORG RASCH. IN THIS MODEL THE ODDS 
OF SUCCESS ON A TEST ITEM ARE HYPOTHESIZED TO BE GIVEN BY THE 
PRODUCT OF THE PERSON'S ABILITY AND THE ITEM'S EASINESS. IN 
ORDER TO FIT THIS MODEL ITEMS MUST BE CHOSEN OR CONSTRUCTED 
TO HAVE SIMILAR DISCRIMINATION. THE RESULTING MEASURES OF 
PERSON ABILITY AND ITEM EASINESS ARE ON A RATIO SCALE WITH A 
NATURAL ZERO AND A DEFINABLE UNIT. THIS PAPER WAS PRESENTED 
AT THE NATIONAL SEMINAR ON ADULT EDUCATION RESEARCH, CHICAGO, 
FEBRUARY 11-13, 1968. (AUTHOR/RT) 



BY- WRIGHT, BENJAMIN D. 



PUB CATE 28 OCT 67 



EDRS PRICE MF-$0.25 HC-$0.96 22P . 



DESCRIPTORS- ^i^ME A SUREMENT TECHNIQUES, 



TESTING, ^INTELLIGENCE , 
R. TEST RESULTS, GOERG 







"PERHISSION TO REPRODUCE THIS 
COPYRIGHTED MATERIAL HAS BEEN GRANTED 

BY 6iz-wav>J^ .D, U)h^.^Ct~ 



1 



TO ERIC AND ORGANIZATIONS OPERATING 
UNDER AGREEMENTS WITH THE U.S. OFFICE OF 
EDUCATION. FURTHER REPRODUCTION OUTSIDE 
THE ERIC SYSTEM REQUIRES PERMISSION OF 
THE COPYRIGHT OWNER." 



U S OEPARIMENI Of HEAUH, EOUCAflON & WElfARE 
CffiCE Of EDUCMION 

1H15 DOCUMEHl HAS BEEN REPRCDUCEO EXACUY AS RECEIVED fROM IHE 
PERSOH OR 0R6AHIZA1I0N ORlOINAllHO If POIHIS Of VIEW OR OPINIONS 
SiAlEO DO NOl NECESSARIIV REPRESENl OffICIAl OffICE Of EOUCAIION 
POSiflON OR POUCV. 









00 

TV. 



% 



'.'•S' 



ERIC 



SAMPLE- FREE TEST CALIBRATION AND PERSON MEASUREMENT 

Berijamin D. Wright 
Pro/es.sor of Education 
University of Chicago 

Ever since I wa.ES old enough to argue with my pals over who had the 
beat I. Q. , 1 -say ''best’’ because some thought 100 v/as perfect and 60 was 
passing, T have been puz;cled by menta.. measurement. Even that noble 
achievement, 100 per cent, is ambiguous. One hunii'ed may .signify the 
welcome new's that we are smart. Or it may just mean t.he test is easy. 

Some students pray for easier tests to make them smarter. 

We all kjiow one ■way a test score can mor#; or Jess be u.scd. If 
you are willing to accept as a ‘whole the set of making up a standardized 

test, you caii get a relative measure of abiiity. If our performance puts 
you at the eightieth percentile among coiiegs me vou'II know where you 
stand. Or will you? The same score will also >ut ll'S eighty-fifth, 

percentile among college v;omen, at the ninetieth percentile among high 
school seniors and above the ninety-ninth percencih- among high school 
juniors. Your ability will depend not only on whi ch items you take but 
on who you are and the company you keep. 

Tile truth is that a scientific s^'udy of changes in ability, of mental 
development, is far beyond our feeble capacities to make measurements. 

How can we possibly obtain quantitative ansv/ers to questions like: How 
much does reading comprehension increase in the first three years of 
oc.hooM V/h>it proportion of ability is native and what learned? or. What 
proportion of mature ability is St-chieved by each yea.r of childhood? 

1 hope I am reminding you of some problem, s v-j-hich afflict present 
practice in mental measurement. The scales on v/hich ability is measured 
are elusive and slippery. They have no logical zero point and no regular 
unit. Their meaning and estimated quality depend upon the specific set of 

T r i ^ 'i T pirU.cukci .tvb-’IiLv distribu'i. • .:f the 







r 



I' 






- 2 *. 



n tn 


■id n / 


a .i 


i.ild yo 


u 


w' j.ali 


entil 


yen 


'til 


arno.n 




w hatevft r 


ft te& 


t. 


> ,v 


ow do 




ou 


:hat ; 


et r>i 


I'te t 


and 


t 


hose 


-Oil h. 


ivt a 


*iC' ’ 


yjir els 


t 


’ ck. 



groups of children? Change the children a:'-] you h.ivt a ' yarclRpck. 
Change the iteir-.a and you have a nev/ yardstick* E .cL cr ]< :tion of ite:r)s 
mcaeures an ability of its own, E'ach meas^ure dep'^ncj f *r d.s rnearo'ng on 
iss own family of test-takers. How can we raakft oljecti.e :oenUl *neaaure- 
nients and build a science of mental development when v, ; ork wit n rubber 
yardsticks 



Obiectivity in Mental Measurement 

*'«■'» #L u. i H I jrwij^a-r- - <in ■ m «» ■ « ^ ■ i »i ii n i m u — «*■« 

The growth of science depends on ihe development -d o bject! -^e 
methods for transforming observation into mea,»urf ment The physical 
Sciences are a good example. Their hallmark if) t.’-« c^ev 1 Dpmf'ut cf 
methods for measuring which are .-specific to the m^'aExrre . . erit intended 
and independent of variation in the other character ! '’tics oi the objects 
rrieasvired ,or the measuring inetruxnenfcs used.. Wh»-n -ve w 4.nt a physical 

9 

measurement we do not worry about the individual i-ieririty >f the measuring 
Ihitr uJTient. We do not concern ourselves with wha: oojec; « other than the 
one we want to measure might sometime be or once liu-.ve :>ien measured, 
it is suificient to knov/ that the instrument is a menber ir. good standing of 
the class of instruments appropriate for the job* 

When a man says he is at the ninetieth percantile i.i math ability, 
we need to know' in what group and on what test before v 2 can iYiake any 
cense of his statement. But when I say I'm S'U'* do you . sk to see my 
yardstick? You know yardsticks differ in color* temperature, compositions, 
weight, even size. Yet you assume they sharf' a scale of length in a manner 
sufficiently independent of these secondary characteristic s to give the 
measurement 5 'll" objective meaning. I may be at a diff ?rent ability 
•...■•er c " tile in group i' : u v.kh. '.‘ku I va ih.v 17S 

po.inds in all of fhern. 










iiMiM 









Let me call meaburement that pcssesises thj® property "objective" 
(Rasch, I960, 1966a, 10 i- 105; I96tb. 56). Two conditions 

arc necessary to achieve it. Firet, tie calibration o( measuring instruments 
muot be independent of those objects t.iat happen ?o be iised for calibration. 

We can’t have che instrument changinf/, every time we use it. Second, the 
rneaii'iur err’.ent of objects must be indcperdent of. which instrument hap.penfi 
to be used for measuring. In practice thewe coriditi'ins can only be 
approximated. But their appro ximatlors is v/hai rr.a.k-)& objective measurement 
possible. 

Object-free instrimient calibration, and iv'\ 34 r'.rm'£nt->l.rce object meastire- 
znent are the coxiditions which ntake it pu.^^uble to generaiirir measurement 
beyond the particular instrument used* to compare- objects measured on 
aimilar but not identical inatruments, and to combine or partition instru- 
ments to suit new measurement requirements. 



The guiding star toward which niodela for measurement should 

aim is this kind of objectivity. Otherwise how car. we ever achieve a 
quantitative grasp of mental abilities or ever cozis cruet a science of mental 
development. The calibration of test item easiness mu»t be independent 
of the particular collection of persons used for the cEilibration. The 
measurement of person ability must be independent of the particular 
telection of test items used for measuring. 



♦ There is a third condition which follows from the first two. The 

evaluation of how well a given set of oboervatiojis can be transformed 
into objective measurements roust be i.ude pendent of which objects and 
v/luch instruments are used to produce the observations. It must also 
be reasonable to hypothesize that objects and instruments have stable 
characteristics which do not interact with each other. 

Were it useful to glue three twelve inch rulers together to make a 
thirty-six inch yardstick or to saw a thirty-six inch yardstick in three 
to make, some twelve inch rulers, we would retain our confidence in 

the obiective meaning of length .mnas urarnsnt^; made with the resulting 
new instruments. 






- 4 » 



When we compare one item to another in orc'er to calibrate items* 
it should not matter whose responses to items we uoe for the comparison. 

iThia means that our method for test calibration sVould give u,'? the same results 
regardless of whom we try the te«t on. TMVff tht only way v'© w;Ii ever foe 
able to construct tests which have uniform meaniii^j regardless of whom we 
measure with them. 

When we test a person it should not matter v^hi.;b seiecdon ■)( items 
we happen to have found coavenient to measure hir.i with or which items he 
happens to have found time to complete. We «houi)d be able to arrive at 
statistically equivalent measuremeTits of hia ability, whate^'’er selection 
of items happen to have been used. 

An Individualistic Approach to Item Analysia 

Well, e:.hortations about objectivity and sa-rcasm at the e^tpense of 
present practices are well and good. But can anything be done about it? 

Is there a better way? 

In the old way of doing things we calibrate a on a standard sample 
of persons. Item easiness i» defined by the proportion of correct responses 
in the sample. Person ability is defined by percts5r-SU*i standing in the sample. 
The approach leans entirely on the appropriateness o' the standardizing 
sample of persons. 

A different approach is possible, one in which no asaumptiona are 
made about the persons used. This approach assiuncs instead a very simple 
model for what happens when any pereoa enco’jinters any item. The model 
just says that the outcome of the encounter is governed by the product of 
the ability of the person and the easiness of the item. That's all, nothing 
more. The more able the person, the better hio chances for success with 
any item. The more easy the item, the more likely any* ^ person is to solve it. 






Th:Li» airnpie model Kas a aurpri&ing consequence for item .'*7ialy3is* 

When measurement is governed by this modeljj it .s pos foible to tike into 
account whatever wab'.litiea persons in the calibration sample happen to have 
ajid to free the c&Ubratioa of tho teat from the par ,ic. flare of these abiUtie®. 

Th(' scores persons obtain on the teat can b« used to remove the mflucnce 
of their abilities from the item analysis. 

I learned this kind of item analysis from Get rg Rasch. But compi^rable 
sugge^ tions have been made by others. Some of the f.deas have been in print 
for years. I don't understand wh/ this powerfvil method is not used in practice. 

Perhaps too fev/ r•ecogni^e the importance of objectivity in mental 
measurement. Perhaps too many despair that it can ever be achieved. Welh 
it can, and I arr. here to prove it. 

The crucial questions arc: Can test calibration r eally be in«Jependent 
of the abili :y characteristics of the person® uaed tf rr-.akc the calibration? 
and Can person measurement, the estimation of a person’s ability from a 
score on some selection of test items, really be independent of v/hich items 
are used foi the measurement? 

I liave placed some data in your hands which illustrate that botn of 
these ideals can be lived up to in practice. These data happen to come from 
the responses of 976 beginning law students to 48 reading comprehension 
items on the Lav/ School Admission Test. But they are only one illustration. 



Pcrson”Free Test Calibration 



In order to examine the dependence of test calibration on the abilities 
of these law students let us construct the worst possible situation. Into a 
Dumb Group, we will put the 325 students who did worst on the test. The 
best of them got a score of 23. Into a Smart Group, we will put the 303 
students who did best. The worst of them got a score of 33. We have two 
groups dramatically different in their ability to succeed on this test of 
reading cornpr r-'h'-'’nsion. Tilers'.- .ira ten poii’-ls daffe: . 3 nc‘; between t<'.e smartest 
of the Dumb Group and the dumbeyr of the Smart Gr up. 



Now ft.'* the acid ':«s^ 



lie,"- wouM a tcai b.iS-ji ^ ch<" Di,v.rnb 



V j r O' s.iy < c> m pa r 



'nar«'* wit^. one byflfd ’’.{'■•e SffLV.Ti Grou.p? 



.?'s g, ' vi" e - 1 ■& nd 2 



lo iC!;v;.3.d Ub of I'OW thinf;v. iOx.k using the ci I v. .■•■/ o; c'oinu things 
[ '.tn-'de >ip tbes ' -.. 3 ! ibrationi^ In u rni?. of s-itnple pcrc.:;;.t,b ts. ba.n cor-.n? in 
I'hguri:- I t c p r n t fli pert on- Lo'.rid lest calibin* li I'hc curve cn *.nc ieff. 

'«;< t::e c tbb.r ;‘t i': .. p-*'-:? iced by rbo Dumb Group. The curve nr the ; igV:t in 
ehr atic.i produ-c^d by the Sroart Group. 

0}>v„ou6'ly A;c/ per^on-bcur.c calibration based cvi. the Dumb Group ifi 
iko\r,g to bn Gnc. .).:r.r.ar*hh' with .>n< baa^rd or> the. Samr'' Group- Fs om the 
I>umb Group v/e ::.\n only «et up prrcenrDe ability mcaaurna for sHidenta 
?-'. nc. score b-cf.wcen ten arid, tv. entv ' three. From th*? >hnart Greup we can 
•S2iiy ssst th#riQ ur?;- for who 

Thc.-ic c..:h.bc -Gons do nGv p--/en overl&p. Anl aboai' all joe ,-i3Ccni‘«»«i 

Ot.o.side the range covered by eigi'er group'''* 

Of course Figure 1 describe ?5 an e:c5.gge rated si'.uation. No one in 
h:© right mind would attcunpt U> 'bc-se A tes?f calibraGo,-, on two suv.n different 
g.roi.'pfi. B'U ^hjs exaggeration haft a 'purpoffe. It iff; aimed at, bringing out 
:-i i. eHc!'-.e r viii c-roperty rf pereon-bound te.?t calibration ana at pr u-viding 
an eicid terG tor any method which claims to be, person -free. 

Nov? \e\ u? see hov;- well the riew way of tesG crihbration handler this 
xagre rat ed T wiH liot burden you vs?ith mathernatica] details. They 

are tcAcred in the references., Should you become inieresteci in applying the 
ir.ctlv^d, let me know. I have a dandy computer program which does it nicely, 
S..PU d techaic*!. w rita • up which desffcribea every step. i,/et us look at the results. 
Please examine Figure 1. Same d*ta. SsLine test. Same students. 



^ienjamin D, V/right 



nOURE 1 

pn3GN-dOiorD mr 




FIGURE 2 



Benjamin D. Wrig!it 



test ckLimmm 



H^r 

o 

7 ! 



‘*•3 



•^2 






O 

-2 



KM 




« c » 



»c 

: ft > 

«*o 

/ 



Jto 



K 



o 













-r 



4T «• fc fc* Jftf' '♦i" 



TEST SCORS 



3^0 



1 - 



iri Figure 1 the x's mark the teat calibrutlon baaeci 
The o's mark the calibration based on the b'mart 
2, ho%v different are the tv^-o calibration curves? 



on tile 
?r cup. 



Dcir-ib Group. 

Tov/, in Figure 



At this point you may have a question about how calibration curves 
work to turn test scores into ability measurementa. Each curve rep’reeenta 
a conversion table. When a person gets a tjc.ore on ihe ter-st, then you enter 
the graph .ilong the bottom at that score, look up vertically tc a calibration 
curve and then across to the left horizontally to read off his abilityo In 
Figure 1 you would read ability in group pei-c entiles, if you could decide 
which curve to use. In Figure 2 ability Is expressed in logs. If you do 
not like logs you can take the antilog and get an ability measure on a ratio 
scale. This may interest you because ability is then measured on a scale 
where zero means exactly no ability and for which a regular nierauingfui 
useful unit can be defined. 

In Figure 1 the calibrations curvets tie not cbrne close to each other. 

In Figure 2 they are aJimoat indistinguishable* Wo-iid you say tliat the 
difference between the tv/o calibr«itions in Figure 2 waa of practical signi- 
ficance? How much would you care which of these calibration curves you 
used to make the test a measuring instrument for you? And yet the hwo 
groups on which they are baaed v/ere constructed to make it as hard as 
possible to achieve person-free teat calibration. 



* For a score of 15, the estimated log ability is about -L 0 the 

ratio scale ability ie about 0. 4, A score of 23 -Indicates a log ability 
of about +1, 0 and a ratio scale abuity of about 2. 7, Thus a score of 
35 indicates about 7 times more ability than a score of 15. 






There is a slight systematic difference. But this reading compre- 
hension test was taken as it stood without any modifications in favor 
of fitting the item analysis model. When test items are chosen tc 
conform to the statistical requirements of the model then no systematic 
differences between calibrations are discernible. 




- 8 > 



One thing that may puzzle you about Figure 2 is the range of test 
calibration. Either calibration curve provides ability measures for ail 
raw scores on the test from 1 to 47. How can that be done when neither 

group obulnod move Ih&n a oi tha sfior«a poaalbU? 

Vhe answer lies in the measuring mbdel on which these calibration 

curves are based. Remember tliat this model uses no assumptions about 
the abilities of the calibration sample. Its only assumption is what happens 
v/hen - 2 ny pet son encounters any item. Out of this assumption it is possible 
to calibrate a test over its entire range of possible scores even when every- 
one in the calibration sample happens to get exactly the same score. 

That sounds imposaibie. But it follows directly from this new item 
analysis modeL The imp-irtaurit idea is tlmt even with the same total score 
persons differ in which iU.m& they succeed on. When the calibration sample 
i« jarge these differences can be used to calibrate the itoms, and hence the 
test over its entire range of possible ©cores, even though only one score 
has acluaily been obafirvec,. _ 

How v/oi.\Id you do tbit with the present methods of item analysis? 
Comparing the calibrations shown in Figures 1 and 2, then., we can 
nee the contriist between the pi e«ent way of doing things, calibration based 
on the ability distr ibutior of a sMndardizing sample, and a new way o£ doing 
thdngs, calibration which is free irom the effects of the ability distribution 
of the persons used for ihe calibration. Which do you prefer?^ 



Even though yov ase,d this new way as your basis for calibration you 
couid construfU all the percentile standardizations you wanted. 

Nothirg would prevexit yoti from embedding your ability measures in 
as! m:*ny samiple co-ntexta as you liked. But, and this is the vital point, 
you w ;.?uui not bf- bound by those contexts. You would have an ability 
measure which was invarimt with respect to the peculiarities of the 
perui ns u.-.erd ro ^jytabiish che teat: calibration. U you were a teat 
niaxv 'hicciu e ; you would ric.t to worry over whether you had 

obtained the ”lg!st star.'.darci:^ing samples tc suit your customcra. Your 
iv.-3t would be equally valid for all situations in which the test was 
a pov ovr iace, Ai ihe l.auie time, since the calibration ’was pe rson- free, 
V-uo voukl be ab- r to use nev/ data as it came in to verify and improve 
ji',-’!'!' cRirbiat.on to add to the item pool and to docmi'ient the scope of 
sdir-tioo v ,n -.vh.;. :h the test ’vas functioning properly. 



mmrn 



lliiiiiiii 



mmrn 















9- 



^ ^ e Person Measur ement 

So much lor person-free teet calibration. Now, ho'A' abo .t the 
companion question. Can ability be mecsured m a fashion tha‘ frees it 
from dapendenct on feh« uce of » iiijced ss-t of Uem*’? la item-# «@ person 
measurement pcssible? If a pool of test items have been calib.-ated on a 
common scale can we use any seilection we w'ajit from that. poo. to make 
etatistically equivalent ability measureinent,’: ? 

In order to judge whether person measurement can ’■ e independent 
of item selection we want a situation that will make it as d.fficult as 
possible for person mtaaurememt to be item-free. For t:di? we will 
divide the 48 items on the original tesit into two fiul)tests ot Z4 items each 
•with no items in common between them. 

It A'ouid be temptixig to make these eubtests equal in verall easiness. 
Then they would be parallel forms. But chat -would be too t me to challenge 
a scheme for item-free person measurement. Instead the two subtests 
will be made as different as possible. The <114 easiest iterre will be used 
to make an Easy^ Test. The 24 hardest items v/Ul be used to make a Hard 
Test. Newq under these circumstances, what is the evidence that ability 
measurement can be item-free? In other words, v-/hat is the evidence that 
the ability estimates based on the Easy Test are atatittically equivalent with 
those based on the Hard Test? 

Why do 1 say statistically equivalent^ We know that th-jre ace ?„ ’wide 
variety of factors at work when a person take?i a test. Even knowing a 
personT' ability'- and an item's easiness -will not tell u^' eKactly how he will 
do on the item. At most -we can say v,liat his cha nces are. This uncertainty 
follows through into his test score. Even if we could give a person the same 
test twice, wiping all memory of the first exposure from his mind before 
his second trial, we would not expect kirn to get the same score both times. 
We know there will be some variation. This imcertainty is an inevitable 
part of the situation. It is the error of measurement. 




iSSgiiiaaaaifeaaaaaaaiiiaiaisaii^^ 




In finding out how item- free person nieasuyemf^nt can bt' w.-. must 
make anc-wancc for this. V»e cannot ask whr.'.her ■it.lrnate'. oi ability based 
on the Ectsy Test are identical with those based on th-t Hard i esi. But we 
can ask whether the two estimates are close enough f.o their dL^erences 
are w'hat we expect from the uncertainties ia the teoting situation. Are 
they close enough in the light of their error of measurement to be considered 

statistically equivalent? 

To answer this question w'e will exanun?» the te it renponaes ox the 97b 
law students to the 48 item test, the score each student earned on the %/hcle 
test can be split into a aubscore on the Easy Test and a sv’bscorc on the Hare 
Teat, This gives each student a pair of independent scores each of which 
«hoidd provide an independent estimat<“ of his reading comprehension abilr /* 

In order to convert these scores into ability meas'ures ci' a common seaxe A/e 
•will calculate calibration curves like the one hi Figure 2 for each of the 
aubtests. To do this we will use item calibratione on a s .:aie conimon to all 
48 iteiTiS 3 Then the separate calibration curves for the Eisy and Hard t sts 
will convert scores on these differexxt tests into ability estimates on a c -rnmon 
tcalc. If the data fit the item analysis model, then the independent results 
from these two different tests should produce atatisticaJy equivalent ability 

estimates. 



f 



Table 1 



The data are in Table L The upper half of the table i» an obvious example 
of item-bound person measurement. The 976 law students average 6.78 points 
more on the Easy Test than they do on the Hard one, biow can two tests wdiicn 
lead to such different scores be equated to yield comparable ability estimates f 
This problem has been handled in the past by referring test scores 
back through a percentile table based on some well chosen standardizing sample 
who have taken b oth forms. That is one way to eqtxate two tests that are 







Benja^ain D. Wright 



Mean 

Std. Error 
Std. Deviation 



Mean 

Std. Error 
Std. Deviation 



111111111111 ^^ 



Table I 

ITEWhFEEE PERSON MEABUBEMEHT 



Test Score 



Easy Test 


Hard Test 


Difference 









17.16 


jlO.38 


6.78 


O.lS 


C.14 


0.11 


1 


4.29 


3,30 



Estimated I^g Abmt-y sfmdardised 

Easy Test Hard Test DlfCerence Difference 

0.003 
0.032 

1.014 













- 11 - 



:.upposed to measure the same ability. The trouble ^r. that this equation 
depends on the characteristics c.' the sample of pezeons used to equate the 
tests. We Know that an equation based on one group of persons not m 
general appropriate for equating meafiurements made on persons frojn 

another groun. 

is there a better way to equate teste? Can we go directly from a 
test score and a person-free calibration of the test icems to a measure 
of ability that does not lean on any particular 6 tanda .‘dialing sample and 
that is statistically invariant with respect to those calibrated items that 

are actually used to obtain the score? 

The lov/er half of Table 1 shows how the new approach eo.uates the 
Easy and Hard tests. For each person we have his ©core on the Easy Test 
and his score on the Hard Test. For each score ive look up the corresponding 
estimated log ability on calibration curves like the ones in Figure 2. For 
each pair of scores we obtain a pair of estimated log abilities. They will 
not be identical. But how do they compare statistically? 

The diatribution of score differences with a mean of 6. 78 and a 
statdard deviation of 3. 30 is almost entirely above zero. But the distribu- 
tion of ability diiferences with a mean of .063 and a standard deviation of 
.749 is nicely centered right at zero. On the average there alternative 
estimates of ability seem to be aiming at the same things 

How does the variation around zero compare with what would be expected 
from errors of measurement alone? To examine this we will standardize 
the differences in ability estimates. For each test score there is not only 
corresponding ability estimate but also the measurement error which 
goes %vith that abiliiv estimate. The difference between the Easy Test and 
Hard Test ability estimates can be divided by the measurement error of this 
difference to produce a standardized difference. 







It is the distribo-tion of these sU-.ndard c;',ffexcxK:c-3 thu s iii show 
us vyhether or not the two ability cstiiTiates a'/.* etatistically equivalent. 

If they •ire, then this staiidardiz^id variable sho kl have a mean cf zero and 
a standard deviation of one. Thai would mean thA' the only varhi uon ob- 
• ervad In ability e&timatci wa» of the «dme at> tha.t «•; pected from 

the error of measurement in the test. Table X sbov, i that, for these 976 
students, the standardized differences in ability e^stimates between the Easy 
and the Hard tests have a mean of 0. 003 and a standard devir.tion of 1. 014. 
lo that close enough to zero and one to auit you? 

What does item-free person measurement mean fo? test constructors 
and test users? If you can make statistically equivalent person measurements 



from any selection of items you wish, then all the tricky an! difficult problems 
of equating parallel forms, connecting sequential forms, and “--lating short 
and long forms disappear, incomplete data ceaaefe to be a prol lem. Yo\l 
can measure a person with whatever item he takes. 

Once you have develcsped a pool of items which conform U this item 
analysis model and have calibrated these itemifs you are ftee to make 

up any tests you wish out of any eeiection from this item pooL the 
basis of these item calibrations alone and without any fvirthcr recourse to 
standardizing samples you can compute a calibration curve or & of 

satimated abilities along with their errors of mea^iurement for every poaeible 



score on any subteat you want- 

AU such abilities will be on the jame ability scale whatcvcj subset 
of items they were estimated from. You can measure John on an Easy Test 
jnd Jinx on a Hard Test and be able to compare their resulting es!:imated 
abilities on the same ratio scale. That means you can say how many times 
rnore of less, able John is than Jim in a precise quantitative and mt aningfui 
way. 

You can measure many children with a ehort test and a. few with a longer 
more precise test and have all the measures on the same ability scale. Thxnk 









of how t^is v/ould expedite screening and selection procedure:.. The nuraher 
of items you gave a child could depend on how close he came o the poini 
c£ deciei ii. Children far away on either v/ould be quicklj/ detected 

with a few itemfl. Only children very near the declf ion point would require 
longer tests in order to estimate more precisely on which side of the criterion 
their ability lay. 

Yoi would let the required precision,, the acceptable errcr of measure- 
ment, determine test length. You would not be bound to any psrticuia: 
predeteirn'-incd set of items. You could select items irom a calibrated pod 
and compo.^e test forms extemporayeously to suit your rneasuriment needs. ^ 
Yet all the measurements made with selections of items from this pool would 
be located on one scale and used to define whatever norms you ">r your friends 
desired. I.ideed, since item analyses would be both person and item free, it 
would be easy to construct tests so that all new data which came in could be 
used directly to verify and improve item calibration, to add nev/ items to 
the item pool, to document the range of persons with whom the ti?st was 
functioning satisfactorily and to establish and extend ability norms for what- 
ever groups were being tested. 



* The most important criterion for item selection is the magnitude 
of measurement error. This is minimum when the person being 
measured has even odds to succeed on the item. That means that 
we would like to choose items just right for ihe person being measuredj 
items just as easy as the person is able. In individual or computerized 
testing where it is possible to choose the next item on the basis of 
information gathered from the persons 's performance up to tliat point, 
this rule specifies exactly v/hat item to use next. 



o_ 













o 

"\C 



I hlJmiL Mode l for M e a s^u r ing Aba ity Obje cnvcl y 



By novv I hope I havv? wheited your appetite tc know more about tiu; 

item analysis model whicVi made these person-iree teat calibrations and 

item-iree person measurements possible? The measuring model contains 

just two parameters. One of these belongs to the person aiid repx'esents 

the amount of his ability, Z . The other belong?? to ^he item and represeDts 

n 

the degree of item easiness, E.. The model combines these two parameters 
to make a probabilistic statement about what happens when the person tiies 
the item. 



Here is the mcaaoring; model: The oddB in favor of success, O , 



are given by the product of the person's ability and the item's easiness E^. <• 



O = Z E. 
ni n 1 



This is the game as saying that: The probability P . that a person 

ni 

with ability Z will succeed on an item with easiness E. is the product Z E 

' * X tt A 

of his ability and the item's easiness divided by one plus this product, 



P . -r Z E./(l Z E.) 
ni n 1 n i 



This is the measuring model used to analyse the forty * eight reading 
comorehension Items on the Law School Admission Test. 



^ This can equally well be e:rprc3sed in terms of log odds 



ability X and log easiness D. as 
^ n ® 1 



L . =: log O . log Z + log E. ~ X + D, . 



ni '' .Tu n ''1 r i 



The log odds form brings out the simple linear structure from which 
this model derives its optimal measuring properties. 



** This can equally well be. expressed in terms of the logistic function as 



P . = 1/(1 1 exp( -(X^ + D )) ) 
m ^ n 1 



BiiflMiaiiaaiBaastiiiMMaiaiiaataiaaiiaaa^^ 



- 15 - 



What does this simple model aay about the scale on v/hich person 
ability and item caaincss are measured? CMds vary irom zero to intinity. 

Since this model gives the odda in favor ot success us the product oi persori 
jj^bility and item easine^s^ ih® natural on ^vhx*i»h to define ability anci 

easiness also varies between zero and Infinlty- 

What does that mean? When a pesr&on has no ability then hie zero 
ability will give hiim zero odds in favor of succasp no matter what item he 
tries. With no ability be baa no chance of succeeding. O/i the other hand, 
if an item has no eaBinesa, then it is infinitely hard and no one can solve 
it. Measurements made on these scales of ability and easiness have a 
natural zero. 

What about the unit of measurement? Reconsider the product of 
person ability and item easiness. There ie an iadeterminancy in that product. 
Wc can multiply ability by any factor we like and, not change the product, as 
long as we divide easiness by the fact.07<. This ghov/9 us that if we 

want to niake measurementa, we will Itmve to define a measurement unit. 

How can such a unit be defined? C>ne way ia to select a special group 
of items as standard. These items can be chosen on theoretical or normative 
grounds. They can be chosen because they represent a rriinimitl level of 
ability or an optimal level. Once chosen the combined ea.sine9» of the,ae items 
is set at one. This calibration will then define a person's ability as his odds 
for success on these standard items. 

When a person is functioning at about the level of easiness of these items, 
then his ability is about one. If he is below the level of these items, then 
his ability is less than one. If in the course of development or education he 
doubles his odds for success, that will mean he has doubled his measurec 
ability. Thus one way a unit or measurement can be defined is in terms of 
even odds to succeed on items selected to be standard. 



ERIC 









sons. Theoe persons can. be chosen because they are J;ypical, or be< ause 
they are irminal for some criterioii or oecaufie they are the dumbest persons 
you can find. Now th© ability unit A» tii© nubility of th*®o estandard persons. 

If you arc juat at their standard then your ability is one. If your odds tc 
succeed on any item are twice those of a standard person then your ability 



is two. 

In our exploratioi-i into what zero mean® and how to define a \init of 
measurement we have luicovered the sense in which measures made with 
this item analysis model ar® on a ratio scale. When one item is twice a.; 
easy as another,, then any person’s odds for success on tne ea.i^ier item 

are twice his odds fox success on the harder one. 

Finally, and moat important, this sumpls; item isuialyeife model has 
^ mathematical property which i© vital to objectivity in mental measurement. 
When observations are made in terms of. dichotomies like right/wrong, 
auccess /failure, then it i® a matheyn&tical fact that this is the only model 



which leads both to per a on*’ free test calibration and to itejm*»fxee person 
measurement. When obaervaticns arc dichotonrs.ou3, the simple .ff<rm of 
this item analysis model is the sufficient and y condition ior 

objective mental measurement. 

Teat Const ruction and the Future of Item Analysis 

What bearing does this model for measuring ability objectively have 
on t.he construction of mental tests? The model is so simple that those of 
you w'ho Vi 3 ve worried about how to do itenOi aiialysis may cry out, What 
about guessing? Whac about item discrimination? ‘What about the influence 
of one test item on another? ” 



It is obvious that in any real testing situation ali of these factors play 
a part. But rather than "What about them?” I prefer to ask, "What do we 
want to do w'ith them? 









V/e can construct tests in winch gueasinR plays f. big part, in which 
items vary widely in their discrimination and in which the answer to one 
item prepares for the next. But do we want to? Net if we aspire to objective 
mental mee.urewtnte. If we value w« wlU employ our te.t 

constructing ingenuity in the opposite direction. * 

M we use multiple chice items, we will devise distractors that make 

guessing infrequent, and we will select items easy enough so that the 
motivation to guess is slight. When we pilot study the charade sties of 
potential items, we will select items for the final pool which axsermunate 
equally and fit an objective measvvring modeL 



Most item analysie modc^U use at l«aat two parameters to ceecrxiie 

items. In addition to the item easiness wMch is part of the simp e 

model presented here, th^.re is also itam drsermnnaUen.^ 

represents the item’s power to magnify or aUemtate 

which ability is expressed. The discovery of . iacrimi..*a * 

was an important step toward understanding 

But as a parameter in the final measuring model i- tatai 

objectivity. 

U item dis crimination is albswed to remain as an active parameter 
in the measuring model, U variation in item dia crimination u 
tolerated in the final pool of teat items, then the poaaib'-lity of 
OH*" fsres t£^st c^lifer^tion Ijs lost# 

It may be useful to estimate item discrimination when conatructing 
an item pool in order to bring it under control tnrough item satect.on. 

But there are more general statistical teats for 

or a set of itema fit this simple item analysi" model. These m 

gener£ii tests are more generally useful. 



13 - 



You might coinplaia th.it ihib ni'e advice impossible to ft. i low. 

Do not aespair. The reading cornpr ehensi.on item.? on fche L^sw Sv hool 
Admission Test were not oonstmcted for equal di.’>c.riminAiicn oi item 
Independence. They are rnuHiple choice item® with five alternatives. 

They differ considerably in discrimination and they are g.rO'jtped tivoui^d 
common paragraphs of text to be read for comprehe.agion. Yet he simple 
item analysis model without guessing, without discrimination ar 1 assu.'^ning 
item independence aucceesded quite well even vi/ith th^:»e unfit da a. This 
shows that the measuring model m robust with rer^poot to depar.ures from 
ita assumptions. We do not have to create a perfect test in ordc'r to use 
the model. Neverthelesag if we a.re really interested in objecti.^e mental 
meaearcmenti^ then the ideals of no gne&smgf equal discriminarion and 
item independence can guide us toward consi.ructiag better test?,. And 
the kind of i<-ern analysis i have illustrated can transform observations 
made v/ith these testa into objective mental measurements.. 



ERIC 









r 



. ] •) - 



i> 



.acviri^<‘r, "Person .^ir.d Pop-iia.tion aa }'^^ y chorni.; t ^ i Cc nee pi.ii. 

bj/ c 1 ' o jpjg, ic, al R e vie v/ , i 9 3 ^ V o ] . 7 , p p o i 4 3 - 1 ?’. 



Ray eh, G. F;rohah rji slie M odels for .. .nd Attainrnenj: 

Tesijj'. CopenXiagerj: Danish Inetitute for Edu. ar ’mal Research, I960. 

Cnaptera 7-VH, X. 



P. B c }.\, Cj . 



"On Gexierai ^^nd the Meaning cf M • surement in Psychology. " 

Berkeley Sympo^ !HLeiLM?-l^£!B£Lih£l 
St'it-istics . Berkeley. University of California ^ress, Mol, Vol. IV, 



pp 



Rase 



h, G. "An iiidividvialistic 
i n M ^ he rj2 ■g_t|cal So cial Cc 
Chicago. Science Resear 



Approach to Uena An^ -.ysis. " In Rgad inga 
:ienc_ev Edited by .Ca; arsfeld and Henry, 
ch AsGO iiates biC. , i 66, pp. S9-107. 



Raschj G. "An Ite^Ts. A.rj.a 
Account. " B ritia h _ 
London, 1966, Vol. 



lysis wluch ta’ces Indi vidua . 

] CTUIii' Hi 

19, Part i, pp. 49-5?'. 



Diffoi eiicea into 
■ SUtist ical Psychology. 



Sitgrea^^es , R. "Review 
Attain n:i e n t T e s I s " 



of Probabilistic Models for clone IntelUgence and 
Psychornetrika, 1963, VoL <?i, pp. ?.19-2<i0. 



Wright, B. and 
Analysis " 



Panchapakeaan, N. "A Procedure for 
Department of Education, University 



.‘sample -Free Item 
oi Chicago, January, 



1968. 





—gill 









